# LIKWID: Lightweight Performance Tools

Jan Treibig, Georg Hager, and Gerhard Wellein

Abstract Exploiting the performance of today's microprocessors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and microbench-marking for reliable upper performance bounds. Moreover, it includes an mpirun wrapper allowing for portable thread-core affinity in MPI and hybrid MPI/threaded applications. To demonstrate the capabilities of the tool set we show the influence of thread affinity on performance using the well-known OpenMP STREAM triad benchmark, use hardware counter tools to study the performance of a stencil code, and finally show how to detect bandwidth problems on ccNUMA-based compute nodes.

#### 1 Introduction

Today's multicore x86 processors bear multiple complexities when aiming for high performance. Conventional performance tuning tools like Intel VTune, OProfile, CodeAnalyst, OpenSpeedshop, etc., require a lot of experience in order to get sensible results. For this reason they are usually unsuitable for the scientific users, who would often be satisfied with a rough overview of the performance properties of their application code. Moreover, advanced tools often require kernel patches and additional software components, which make them unwieldy and bug-prone. Additional confusion arises with the complex

J. Treibig · G. Hager · G. Wellein

Erlangen Regional Computing Center (RRZE), Friedrich-Alexander Universität Erlangen-Nürnberg, Martensstr. 1, D-91058 Erlangen, Germany

e-mail: {jan.treibig,georg.hager,gerhard.wellein}@rrze.uni-erlangen.de

multicore, multicache, multisocket structure of modern systems (see Fig. 1); users are all too often at a loss about how hardware thread IDs are assigned to resources like cores, caches, sockets and NUMA domains. Moreover, the technical details of how threads and processes are bound to those resources vary strongly across compilers and MPI libraries.

LIKWID ("Like I Knew What I'm Doing") is a set of easy to use command line tools to support optimization. It is targeted towards performance-oriented programming in a Linux environment, does not require any kernel patching, and is suitable for Intel and AMD processor architectures. Multi-threaded and even hybrid shared/distributed-memory parallel code is supported. LIKWID comprises the following tools:

- likwid-features can display and alter the state of the on-chip hardware prefetching units in Intel x86 processors.
- likwid-topology probes the hardware thread and cache topology in multicore, multisocket nodes. Knowledge like this is required to optimize resource usage like, e.g., shared caches and data paths, physical cores, and ccNUMA locality domains in parallel code.
- likwid-perfCtr measures performance counter metrics over the complete runtime of an application or, with support from a simple API, between arbitrary points in the code. Although it is possible to specify the full, hardware-dependent event names, some predefined event sets simplify matters when standard information like memory bandwidth or Flop counts is needed
- likwid-pin enforces thread-core affinity in a multi-threaded application "from the outside," i.e., without changing the source code. It works with all threading models that are based on POSIX threads, and is also compatible with hybrid "MPI+threads" programming. Sensible use of likwid-pin requires correct information about thread numbering and cache topology, which can be delivered by likwid-topology (see above).
- likwid-mpirun allows to pin a pure MPI or hybrid MPI/threaded application to dedicated compute resources in an intuitive and portable way.
- likwid-bench is a microbenchmarking framework allowing rapid prototyping of small assembly kernels. It supports threading, thread and memory placement, and performance measurement. likwid-bench comes with a wide range of typical benchmark cases and can be used as a stand-alone benchmarking application.

Although the six tools may appear to be partly unrelated, they solve the typical problems application programmers encounter when porting and running their code on complex multicore/multisocket environments. Hence, we consider it a natural idea to provide them as a single tool set.

This paper is organized as follows. Section 2 describes two of the tools in some detail and gives hints for typical use. Section 3 demonstrates the use of LIKWID in three different case studies, and Section 4 gives a summary and an outlook to future work.



Fig. 1: Cache and thread topology of Intel Core 2 Quad and Nehalem EP Westmere processors.

### 2 Tools

LIKWID only supports x86-based processors. Given the strong prevalence of those architectures in the HPC market (e.g., 90% of all systems in the latest Top 500 list are of x86 type) we do not consider this a severe limitation. In other areas like, e.g., workstations or desktops, the x86 dominance is even larger.

An important concept shared by all tools in the set is logical numbering of compute resources inside so-called *thread domains*. Under the Linux OS, hardware threads in a compute node are numbered according to some scheme that heavily depends on the BIOS and kernel version, and which may be unrelated to natural topological units like cache groups, sockets, etc. Since users naturally think in terms of topological structures, LIKWID introduces a simple and yet flexible syntax for specifying processor resources. This syntax consists of a prefix character and a list of logical IDs, which can also include ranges. The following domains are supported:

| Node                    | N      |
|-------------------------|--------|
| Socket                  | S[0-9] |
| Last level shared cache | C[0-9] |
| NUMA domain             | M[0-9] |

Multiple ID lists can be combined, allowing a flexible numbering of compute resources. To indicate, e.g., the first two cores of NUMA domains 1 and 3, the following string can be used: MO:0,1@M2:0,1.

In the following we describe two of the six tools in more detail. A thorough documentation of all tools apart from the man pages is found on the WIKI pages on the LIKWID homepage [1].

# 2.1 likwid-perfctr

Hardware-specific optimization requires an intimate knowledge of the microarchitecture of a processor and the characteristics of the code. While many problems can be solved with profiling, common sense, and runtime measurements, additional information is often useful to get a complete picture.

Performance counters are facilities to count hardware events during code execution on a processor. Since this mechanism is implemented directly in hardware there is no overhead involved. All modern processors provide hardware performance counters. They are attractive for application programmers, because they allow an in-depth view on what happens on the processor while running applications. As shown below, likwid-perfctr has practically zero overhead since it reads performance metrics at predefined points. It does not support statistical counter sampling. At the time of writing, likwid-perfctr runs on all current x86-based architectures.

The probably best known and widespread existing tool is the PAPI library [6, 5]. A lot of research is targeted towards using hardware counter data for automatic analysis and detecting potential performance bottlenecks [2, 3, 4]. However, those solutions are often too unwieldy for the common user, who would prefer a quick overview as a first step in performance analysis. A key design goal for likwid-perfctr was ease of installation and use, minimal system requirements (no additional kernel modules and patches), and — at least for basic functionality — no changes to the user code. A prototype for the development of likwid-perfctr is the SGI tool "perfex," which was available on MIPS-based IRIX machines as part of the "SpeedShop" performance suite. Cray provides a similar, PAPI-based tool (craypat) on their systems [8]. likwid-perfctr is a dedicated command line tool for programmers, allowing quick and flexible measurement of hardware performance counters on x86 processors, and is available as open source. It allows simultaneous measurements on multiple cores. Events that are shared among the cores of a socket (this pertains to the "uncore" events on Core i7-type processors) are supported via "socket locks," which enforce that all uncore event counts are assigned to one thread per socket. Events are specified on the command line, and the number of events to count concurrently is limited by the number of performance counters on the CPU. These features are available without any changes in the user's source code. A small instrumentation ("marker") API allows one to restrict measurements to certain parts of the code (named regions) with automatic accumulation over all regions of the same name. An important difference to most existing performance tools is that event counts are strictly core-based instead of process-based: Everything that runs and generates events on a core is taken into account; no attempt is made to filter events according to the process that caused them. The user is responsible for enforcing appropriate affinity to get sensible results. This could be achieved with likwid-perfctr itself or alternatively via likwid-pin (see below for more information):

```
$ likwid-perfctr -C S0:0 \
  -g SIMD_COMP_INST_RETIRED_PACKED_DOUBLE:PMC0,\
        SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE:PMC1 ./a.out
```

(See below for typical output in a more elaborate setting.) In this example, the computational double precision packed and scalar SSE retired instruction counts on an Intel Core 2 processor are assigned to performance counters 0 and 1 and measured on the first core (ID 0) of the first socket (domain S0) over the duration of a.out's runtime. As a side effect, it becomes possible to use likwid-perfctr as a monitoring tool for a complete shared-memory node, just by specifying all cores for measurement and, e.g., "sleep" as the application:

```
$ likwid-perfctr -c N:0-7 \
  -g SIMD_COMP_INST_RETIRED_PACKED_DOUBLE:PMC0,\
        SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE:PMC1 \
        sleep 1
```

Apart from naming events as they are documented in the vendor's manuals, it is also possible to use preconfigured *event sets* (groups) with derived metrics. This provides a simple abstraction layer in cases where standard information like memory bandwidth, Flops per second, etc., is sufficient:

```
$ likwid-perfctr -C N:0-3 -g FLOPS_DP ./a.out
```

The event groups are partly inspired from a technical report published by AMD [7], and all supported groups can be obtained by using the <code>-a</code> command line switch. We try to provide the same preconfigured event groups on all supported architectures, as long as the native events support them. This allows the beginner to concentrate on the useful information right away, without the need to look up events in the manuals (similar to PAPI's high-level events). In the usage scenarios described so far there is no interference of <code>likwid-perfctr</code> while user code is being executed, i.e., the overhead is very small (apart from the unavoidable API call overhead in marker mode).

The following example illustrates the use of the marker API in a serial program with two named regions ("Main" and "Accum"):

```
#include <likwid.h>
...
int coreID = likwid_processGetProcessorId();
likwid_markerInit(numberOfThreads,numberOfRegions);
int MainId = likwid_markerRegisterRegion("Main");
int AccumId = likwid_markerRegisterRegion("Accum");
likwid_markerStartRegion(0, coreID);
// measured code region "Main" here
likwid_markerStopRegion(0, coreID, MainId);

for (j = 0; j < N; j++) {
    likwid_markerStartRegion(0, coreID);
}</pre>
```

```
// measured code region "Accum" here
likwid_markerStopRegion(0, coreID, AccumId);
}
likwid_markerClose();
```

Event counts are automatically accumulated on multiple calls. Nesting or partial overlap of code regions is not allowed. The API requires specification of a thread ID (0 for one process only in the example) and the core ID of the thread/process. The LIKWID API provides simple functions to determine the core ID of processes or threads. The following listing shows the shortened output of likwid-perfctr after measurement of the FLOPS\_DP event group on four cores of an Intel Core 2 Quad processor in marker mode with two named regions ("Init" and "Benchmark," respectively):

```
$ likwid-perfCtr -c 0-3 -g FLOPS_DP -m ./a.out
                Intel Core 2 45nm processor
Measuring group FLOPS_DP
Region: Init
                      | core 0 | core 1 | core 2 | core 3 |
                                 | 313742 | 376154 | 355430 | 341988
| 217578 | 504187 | 477785 | 459276
           INSTR_RETIRED_ANY
        CPU_CLK_UNHALTED_CORE
· · ·
+-----+
   Metric | core 0 | core 1 | core 2
                                                           core 3
 Runtime [s] | 7.67906e-05 | 0.000177945 | 0.000168626 | 0.000162094 |
 CPI | 0.693493 | 1.34037 | 1.34424 | 1.34296
DP MFlops/s | 0.0130224 | 0.00561973 | 0.00593027 | 0.00616926
Region: Benchmark
                                                | core 2
           Event
                         core 0 | core 1
| INSTR_RETIRED_ANY | 1.88024e+07 | 1.85461e+07 | 1.84947e+07 | 1.84766e+07 |
 CPI | 1.52023 | 1.52252 | 1.52708 | 1.52661
DP MFlops/s | 1624.08 | 1644.03 | 1643.68 | 1645.8
```

Note that the INSTR\_RETIRED\_ANY and CPU\_CLK\_UNHALTED\_CORE events are always counted (using two nonassignable "fixed counters" on the Core 2 architecture), so that the derived CPI metric ("cycles per instruction") is easily obtained.

### 2.2 likwid-pin

Thread/process affinity is vital for performance. If topology information is available, it is possible to "pin" threads according to the application's resource requirements like bandwidth, cache sizes, etc. Correct pinning is even more important on processors supporting SMT, where multiple hardware threads share resources on a single core. likwid-pin supports thread affinity for all threading models that are based on POSIX threads, which includes most OpenMP implementations. By overloading the pthread\_create API



Fig. 2: Basic architecture of likwid-pin.

call with a shared library wrapper, each thread can be pinned in turn upon creation, working through a list of core IDs. This list, and possibly other parameters, are encoded in environment variables that are evaluated when the library wrapper is first called. likwid-pin simply starts the user application with the library preloaded.

This architecture is illustrated in Fig. 2. No code changes are required, but the application must be dynamically linked. This mechanism is independent of the processor architecture, but the way the compiled code creates application threads must be taken into account: For instance, the Intel OpenMP implementation always runs OMP\_NUM\_THREADS threads but uses the first newly created thread as a management thread, which should not be pinned. This knowledge must be communicated to the wrapper library. The following example shows how to use likwid-pin with an OpenMP application compiled with the Intel compiler:

```
$ export OMP_NUM_THREADS=4
$ likwid-pin -c N:0-3 -t intel ./a.out
```

In general, likwid-pin can be used as a replacement for the taskset tool, which cannot pin threads individually. Currently, POSIX threads, Intel OpenMP, and GNU (gcc) OpenMP are supported directly, and the latter is assumed as the default if the -t option is not used. A bit mask can be specified to identify the management threads for cases not covered by the available parameters to the -t option. Moreover, likwid-pin can also be employed for hybrid programs that combine MPI with some threading model, if the MPI process startup mechanism establishes a Linux cpuset for every process.

The big advantage of likwid-pin is its portable approach to the pinning problem, since the same tool can be used for most applications, compilers, MPI implementations, and processor types. In Section 3.1 the usage model is analyzed in more detail on the example of the STREAM triad.



Fig. 3: STREAM triad test run with the Intel C compiler on a dual-socket Intel Westmere system (six physical cores per socket). In Fig. (a) threads are not pinned and the Intel pinning mechanism is disabled. In Fig. (b) the application is pinned such that threads are equally distributed on the sockets to utilize the memory bandwidth in the most effective way. Moreover, the threads are first distributed over physical cores and then over SMT threads.

### 3 Case studies

# 3.1 Case study 1: Influence of thread topology on STREAM triad performance

To illustrate the general importance of thread affinity we use the well known OpenMP STREAM triad on an Intel Westmere dual-socket system. Intel Westmere is a hexacore design based on the Nehalem architecture and supports two SMT threads per physical core. The Intel C compiler version 11.1 was used with options <code>-openmp -03 -xSSE4.2 -fno-fnalias</code>. Intel compilers support thread affinity only if the application is executed on Intel processors. The functionality of this topology interface is controlled by setting the environment variable <code>KMP\_AFFINITY</code>. In our tests <code>KMP\_AFFINITY</code> was set to <code>disabled</code>. For the case of the STREAM triad on these <code>ccNUMA</code> architectures the best performance is achieved if threads are equally distributed across the two sockets.

Figure 3 shows the results. The non-pinned case shows a large variance in performance especially for the smaller thread counts where the probability is large that only one socket is used. With larger thread counts there is a high probability that both sockets are used, still there is also a chance that cores are oversubscribed, which reduces performance. The pinned case consistently shows high performance throughout. It is apparent that the SMT threads of Westmere increase the chance of different threads fighting for common resources.



Fig. 4: Time-resolved results for the iteration phase of an MPI-parallel Lattice Boltzmann solver on one socket (four cores) of an Intel Nehalem compute node. The compute performance in MFlops/s (Fig. (a)) and the memory bandwidth in MBytes/s (Fig. (b)) are shown over a duration of 10 seconds, comparing two versions of the computational kernel (standard C versus SIMD intrinsics).

# 3.2 Case Study 2: Monitoring the performance of a Lattice Boltzmann fluid solver

To demonstrate the daemon mode option of likwid-perfctr an MPI-parallel Lattice Boltzmann fluid solver was analyzed on a Intel Nehalem quad-core system (Fig. 4). The daemon mode of likwid-perfctr allows time-resolved measurements of counter values and derived metrics in performance groups. It is used as follows:

#### \$ likwid-perfctr -c S0:0-3 -g FLOPS\_DP -d 800ms

This command measures the performance group FLOPS\_DP on all physical cores of the first socket, with an interval of 800 ms between samples. likwid-perfctr will only read out the hardware monitoring counters and print the difference between the current and the previous measurement. Therefore, the overhead is kept to a minimum. For this analysis the performance groups FLOPS\_DP and MEM were used.

# 3.3 Case Study 3: Detecting ccNUMA problems on modern compute nodes

Many performance problems in shared memory codes are caused by an inefficient use of the ccNUMA memory organization on modern compute nodes.



Fig. 5: NUMA problems reproduced with likwid-bench on the example of a memory copy benchmark on an Intel Nehalem dual-socket quad-core machine. The bandwidth on the top is the effective total application bandwidth as measured by likwid-bench itself. The other bandwidth values are from likwid-perfctr measurements in wrapper mode.

CcNUMA technology achieves scalable memory size and bandwidth at the price of higher programming complexity: The well-known locality and contention problems can have a large impact on the performance of multi-threaded memory-bound programs if parallel first touch placement is not used on initialization loops, or is not possible for some reason [9].

likwid-perfctr supports the developer in detecting NUMA problems with two performance groups: MEM and NUMA. While on some architectures like, e.g., newer Intel systems, all events can be measured in one run using the MEM group, a separate group (NUMA) is necessary on AMD processors.

The example in Fig. 5 shows results for a memory copy benchmark, which is part of likwid-bench. Since likwid-bench allows easy control of thread and data placement it is well suited to demonstrate the capabilities of likwid-perfctr in detecting NUMA problems. Here, likwid-perfctr was used as follows:

## \$ likwid-perfctr -c S0:0@S1:0 -g MEM ./a.out

The relevant output for the derived metrics could look like this:

| + -  |                            | -+-       |                    | -+- |          | -+ |
|------|----------------------------|-----------|--------------------|-----|----------|----|
| İ    | Metric                     | İ         | core 0             | •   | core 4   | į  |
| <br> | Runtime [s]                | <br> <br> | 4.71567<br>16.4815 | İ   | 0.138517 |    |
| i    |                            | i         |                    | -   | 6998.71  | i  |
|      | Remote Read BW [MBytes/s]  | -         | 0.454106           | -   | 4589.46  |    |
|      | Remote Write BW [MBytes/s] | -         | 0.0705132          | 1   | 2289.32  | -  |
| 1    | Remote BW [MBytes/s]       | 1         | 0.524619           | 1   | 6878.78  | 1  |

All threads were executed on socket zero, as can be seen from the runtime which is based on the CPU\_CLK\_UNHALTED\_CORE metric. All program data originated from socket one since there is practically no local memory band-

width. Hence, all bandwidth on socket one came from the remote socket. Fig. 5 (a) shows the results for sequential data initialization on one socket; the overall bandwidth is 9.83 GB/s. Fig. 5 (b) shows the case with correct first touch data placement on both sockets. The effective bandwidth is 23.15 GB/s, and the scalable ccNUMA system is used in the most efficient way. If an application cannot be easily changed to make use of the first touch memory policy, a viable compromise is often to switch to automatic round-robin page placement across a set of NUMA domains, or *interleaving*. likwid-pin can enforce interleaving for all NUMA domains included in a threaded run. This can be achieved with the -i option:

```
$ likwid-pin -c S0:0-3@S1:0-3 -t intel -i ./a.out
```

Figure 5 (c) reveals that the memory bandwidth achieved with interleaving policy, while not as good as with correct first touch, is still much larger than the bandwidth of case (a) with all data in one NUMA domain.

### 4 Conclusion and future plans

LIKWID is a collection of command line applications supporting performance-oriented software developers in their effort to utilize today's multicore processors in an effective manner. LIKWID does not try to follow the trend to provide yet another complex and sophisticated tooling environment, which would be difficult to set up and would overwhelm the average user with large amounts of data. Instead it tries to make the important functionality accessible with as few obstacles as possible. The focus is put on simplicity and low overhead. likwid-topology and likwid-pin enable the user to account for the influence of thread and cache topology on performance and pin their application to physical resources in all possible scenarios with one single tool and no code changes. The usage of likwid-perfctr was demonstrated on two examples. LIKWID is open source and released under GPL2. It can be downloaded at http://code.google.com/p/likwid/.

Future plans include applying the philosophy of LIKWID to other areas like, e.g., profiling (also on the assembly level). Emphasis will also be put on a further improvement with regard to usability. It is also planned to port parts of LIKWID to the Windows operating system. An ongoing effort is to add support for present and upcoming architectures like, e.g., the Intel Sandy Bridge microarchitecture.

# Acknowledgment

We are indebted to Intel Germany for providing test systems and early access hardware for benchmarking. A special acknowledgment goes to Michael Meier, who had the basic idea for likwid-pin, implemented the prototype, and provided many useful thoughts in discussions. This work was supported by the Competence Network for Scientific and Technical High Performance Computing in Bavaria (KONWIHR) under the project "OMI4papps."

### References

- 1. Homepage of LIKWID tool suite http://code.google.com/p/likwid/.
- G. Jost, J. Haoqiang, J. Labarta, J. Gimenez, J. Caubet: Performance analysis of multilevel parallel applications on shared memory architectures. Proceedings of the Parallel and Distributed Processing Symposium, 2003.
- M. Gerndt, E. Kereku: Automatic Memory Access Analysis with Periscope. ICCS '07: Proceedings of the 7th international conference on Computational Science, 2007, 847–854.
- M. Gerndt, K. Fürlinger, E. Kereku: Periscope: Advanced Techniques for Performance Analysis. PARCO, 2005, 15–26.
- D. Terpstra, H. Jagode, H. You, J. Dongarra: Collecting Performance Data with PAPI-C. Proceedings of the 3rd Parallel Tools Workshop, Springer Verlag, Dresden, Germany, 2010.
- S. Browne, C. Deane, G. Ho, P. Mucci: PAPI: A Portable Interface to Hardware Performance Counters. Proceedings of Department of Defense HPCMP Users Group Conference, June 1999.
- P. J. Drongowski: Basic Performance Measurements for AMD Athlon 64, AMD Opteron and AMD Phenom Processors. Technical Note, Advanced Micro Devices, Inc. Boston Design Center, September 2008.
- L. DeRose, B. Homer, and D. Johnson: Detecting application load imbalance on high end massively parallel systems. Euro-Par 2007 Parallel Processing Conference, 2007, 150–159.
- G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Press, ISBN 978-1439811924, July 2010.